Notes: Moira plotted a scatterplot of perceived audience size vs actual audience size. Found that most posts had a much larger actual audience than perceived.
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
Notes: Let’s examine two variables of the pseudo Facebook data by using a scatterplot. We will find the friend count by age.
library(ggplot2)
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point()
Response: * The youngest and oldest users have the most variety of the number of friends. This could be more of an indication that younger people have registered using an old age. * Majority of people with high friend counts are younger than 26. * There are certain ages beyond 50 that have higher friend counts than the rest - 69, 103, 108. * The majority of users all ages have less than 1000 friends. ***
Notes: This for switching over to ggplot syntax from qplot. I have been using ggplot this whole time so there is no change.
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point() +
xlim(13,90) +
scale_x_continuous(breaks=seq(13,90,5))
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
Notes: When most of the points are close together, like those in the under 25 years range, it can be hard to interpret the graph. Let’s make it easier by adding a transparency value to the scatterplot.
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point(alpha=1/20) +
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Adjusting the transparency makes it much easier to notice that most of our data points are less than 1000 friends. This was not easy to tell with the original scatterplot.
You may have noticed that the ages are shown as columns - one for each full age as an integer. We can add noise to our plot by using geom jitter.
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_jitter(alpha=1/20) +
xlim(13,90)
## Warning: Removed 5168 rows containing missing values (geom_point).
Response: Most of the high friend counts for ages above 30 are gone. The younger users no longer show such high friend counts. ***
Notes: Coord trans transforms the form of the data. For example, we could show log10 or sqrt of the data. The difference between using coord trans and scaling the axis is that coord_trans occurs after calculating the statistics of the data.
ggplot(data=pf, aes(x=age, y=friend_count)) +
geom_point(alpha=1/20) +
xlim(13,90) +
coord_trans(y= "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
Friend counts for the older ages are becoming even fewer. ***
Notes:
ggplot(data=pf, aes(x=age, y=friendships_initiated)) + geom_point(alpha=1/15, position=position_jitter(h=0)) +scale_x_continuous(breaks=seq(13,90,3), limits=c(13,90)) +
coord_trans(y='sqrt')+
xlab('Age') +
ylab('Number of Friendships Initiated')
## Warning: Removed 5196 rows containing missing values (geom_point).
Notes: The data can also be transformed with knowledge of the data domain to avoid overplotting. In her study of perceived vs actual audience size, Moira took percentages related to total friend count to avoid overplotting.
Notes: We will use the dplyr package to create summaries of each of the data types in the data frame.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_group <- group_by(pf, age)
pf.fc_by_age <- summarize(age_group, mean_friends = mean(friend_count), median_friends = median(friend_count), n= n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age, 6)
## # A tibble: 6 × 4
## age mean_friends median_friends n
## <int> <dbl> <dbl> <int>
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
Create your plot!
ggplot(data=pf.fc_by_age, aes(x=age, y=mean_friends)) +
geom_line(color='blue')+
ylab('Mean Number of Friends')
Notes: Plots can be overlayed with their summaries to make data interpretation easier. We will plot the same scatterplot as before, except now we will overlay the line plots to show summaries.
ggplot(data=pf, aes(x=age, y=friend_count)) +
geom_point(alpha=1/20,color='orange', position=position_jitter(h=0)) +
geom_line(stat="summary", fun.y=quantile, fun.args = list(probs=0.9), color='blue') +
geom_line(stat='summary', fun.y=quantile, fun.args=list(probs=0.1), color='red', linetype=2) +
geom_line(stat='summary', fun.y=mean, color='black', linetype=3)+
coord_cartesian(xlim=c(13,90), ylim=c(0,1000)) +
scale_x_continuous(breaks=seq(13,90,3)) +
ylab('Friend Count')
Response: The means and 90% quantile are below 1000 friends for all age groups. There is a spike at age 69 and random spikes beyond age 79.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes: Correlation is the covariance of two variables divided by each of their standard deviations. A correlation of 1 indicates a strong relationship between the variables while a correlation of 0 indicates no relationship. We will find the correlation between age and friend count.
cor.test(pf$age, pf$friend_count)
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
with(pf, cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response: -0.027
Notes: Looking at the plot we generated earlier, it seems that the mismatch between age and friend count starts after age 70. Let’s look at the same relationship but exclude users that are more than 70 years old.
with(subset(pf, age <=70),cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes: Three different correlation methods are available for different data sets - Pearson, Kendall, and Spearman.
Notes: Let’s test knowledge of correlation by creating a scatteplot of two variables that have a strong correlation - likes received vs. desktop likes received.
ggplot(data=pf, aes(x=www_likes_received, y=likes_received)) +
geom_point(alpha = 1/50, color='red')
Notes: We will now zoom in to the data by making the 95% quantile the upper limit. This way, we should show the bottom 95% of the data in the graph. We will also add line to visualize the correlation.
ggplot(data=pf, aes(x=www_likes_received, y=likes_received)) +
geom_point(alpha=1/25, color='red') +
coord_cartesian(xlim=c(0,quantile(pf$www_likes_received,0.95)), ylim=c(0, quantile(pf$likes_received, 0.95))) +
geom_smooth(method='lm')
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
with(pf, cor.test(www_likes_received, likes_received))
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response: 0.948
Notes: Strong corrlations aren’t always a good thing because if two variables are strongly correlated, it may not be clear which one is driving the other.
Notes:
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Create your plot!
ggplot(data=Mitchell, aes(x=Month, y= Temp)) +
geom_point()
Guess: 0
cor.test(Mitchell$Month, Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes: The information for months is stored as a number starting from Jan 1976. In reality, we want the months going from 1-12.
ggplot(data=Mitchell, aes(x=Month, y=Temp)) +
geom_point() +
scale_x_continuous(breaks = seq(0,203, 12))
ggplot(data=Mitchell, aes(x=(Month%%12), y=Temp)) +
geom_point(color='brown')
What do you notice? Response: It’s a sine wave
Watch the solution video and check out the Instructor Notes! Notes: The nature of the data should represent the shape of the graph. If it doesn’t, then your graph should be 50% wider than it is tall.
Notes: Let’s convert age from years to months to see the effect of noise in graphs.
pf$age_with_months <- with(pf, age+(12-dob_month)/12)
age_group_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarize(age_group_months, mean_friends = mean(friend_count), median_friends = median(as.numeric(friend_count)), n= n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
ggplot(data=subset(pf.fc_by_age_months, age_with_months < 71), aes(x=age_with_months, y=mean_friends)) +
geom_line() +
scale_x_continuous(breaks = seq(0,70, 2)) +
xlab('Age (year)')
Notes: The age with months graph has a lot more noise than the age graph. That’s because there are smaller bins and less data in each bin so the mean varies more in the age by month graph. Let’s take a closer look.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(data=subset(pf.fc_by_age, age < 71), aes(x=age, y=mean_friends)) +
geom_line(color='blue')+
ylab('Mean Number of Friends')
p2 <- ggplot(data=subset(pf.fc_by_age_months, age_with_months < 71), aes(x=age_with_months, y=mean_friends)) +
geom_line() +
scale_x_continuous(breaks = seq(0,70, 2)) +
xlab('Age (year)')+
ylab('Mean Number of Friends')
grid.arrange(p1,p2, ncol=1)
Both of these graphs can be smoothed over by adding the smooth layer in ggplot. However, the smoothing layer isnt perfect and does not capture all of the nuances in the relationship
p1 <- ggplot(data=subset(pf.fc_by_age, age < 71), aes(x=age, y=mean_friends)) +
geom_line(color='blue')+
geom_smooth()+
scale_x_continuous(breaks = seq(0,70, 2)) +
ylab('Mean Number of Friends')
p2 <- ggplot(data=subset(pf.fc_by_age_months, age_with_months < 71), aes(x=age_with_months, y=mean_friends)) +
geom_smooth()+
geom_line() +
scale_x_continuous(breaks = seq(0,70, 2)) +
xlab('Age (year)')+
ylab('Mean Number of Friends')
grid.arrange(p1,p2, ncol=1)
***
Notes: Do not have to pick a single plot. Should create different visualizatons of the same data to get different insights.
We should decide on one or two visualizations when it comes time to communicate our findings.
Reflection: Scatterplots, Conditional Means, adding jitter to scatter plots with discrete data, using dplyr to group data and summarize, overlaying line plots onto scatter plots, dealing with overplotting by using alpha and knowledge of the data set, correlation, dealing with noisy data
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!